{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dacc824e",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
    "        <br>\n",
    "        <br>\n",
    "        <a href=\"https://arize-phoenix.readthedocs.io/projects/evals/en/latest/\">Evals Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
    "    </p>\n",
    "</center>\n",
    "<h1 align=\"center\">Arize Phoenix Evals 2.0</h1>\n",
    "\n",
    "We are excited to introduce `arize-phoenix-evals` 2.0, an open-source library providing tools to evaluate AI systems so you can build faster and with more confidence. We have rebuilt the library from the ground up to make evaluation faster, easier, and more powerful.\n",
    "\n",
    "**In this notebook, you will learn more about:**\n",
    "\n",
    "1. Our guiding principles\n",
    "2. The core library abstractions\n",
    "3. Usage examples\n",
    "4. What's changed between 2.0 and the previous version\n",
    "\n",
    "#### Our Guiding Principles\n",
    "\n",
    "**Fast:** We are optimizing for maximum speed, minimal headache.\n",
    "\n",
    "**Ergonomic:** It should be user-friendly and easy to pick up.\n",
    "\n",
    "**Flexible:** We make minimal assumptions about the shape of your data or evals.\n",
    "\n",
    "**Powerful:** Built with extensibility in mind, the library enables complex evaluation tasks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "aa7312e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install arize-phoenix \"arize-phoenix-evals>=2.0.0\" openai pandas openinference-instrumentation-openai --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ebb216a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "⚠️ PHOENIX_COLLECTOR_ENDPOINT is set to https://app.phoenix.arize.com/s/ehutton.\n",
      "⚠️ This means that traces will be sent to the collector endpoint and not this app.\n",
      "⚠️ If you would like to use this app to view traces, please unset this environmentvariable via e.g. `del os.environ['PHOENIX_COLLECTOR_ENDPOINT']` \n",
      "⚠️ You will need to restart your notebook to apply this change.\n",
      "/Users/elizabethhutton/.local/share/uv/python/cpython-3.9.23-macos-aarch64-none/lib/python3.9/contextlib.py:126: SAWarning: Skipped unsupported reflection of expression-based index ix_cumulative_llm_token_count_total\n",
      "  next(self.gen)\n",
      "/Users/elizabethhutton/.local/share/uv/python/cpython-3.9.23-macos-aarch64-none/lib/python3.9/contextlib.py:126: SAWarning: Skipped unsupported reflection of expression-based index ix_latency\n",
      "  next(self.gen)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🌍 To view the Phoenix app in your browser, visit http://localhost:6006/\n",
      "📖 For more information on how to use Phoenix, check out https://arize.com/docs/phoenix\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/elizabethhutton/Projects/phoenix/src/phoenix/otel/otel.py:434: UserWarning: Could not infer collector endpoint protocol, defaulting to HTTP.\n",
      "  warnings.warn(\"Could not infer collector endpoint protocol, defaulting to HTTP.\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔭 OpenTelemetry Tracing Details 🔭\n",
      "|  Phoenix Project: default\n",
      "|  Span Processor: SimpleSpanProcessor\n",
      "|  Collector Endpoint: https://app.phoenix.arize.com/s/ehutton/v1/traces\n",
      "|  Transport: HTTP + protobuf\n",
      "|  Transport Headers: {'authorization': '****'}\n",
      "|  \n",
      "|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.\n",
      "|  \n",
      "|  ⚠️ WARNING: It is strongly advised to use a BatchSpanProcessor in production environments.\n",
      "|  \n",
      "|  `register` has set this TracerProvider as the global OpenTelemetry default.\n",
      "|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# set up phoenix app and tracing\n",
    "import phoenix as px\n",
    "from phoenix.otel import register\n",
    "\n",
    "px.launch_app()\n",
    "tracer_provider = register(auto_instrument=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b026cceb",
   "metadata": {},
   "source": [
    "## LLM Configuration\n",
    "\n",
    "**Core Design Principle:** The library should work with any LLM model and provider.\n",
    "\n",
    "The LLM wrapper unifies generation tasks across model providers by delegating to the most commonly installed client SDKs (OpenAI, LangChain, LiteLLM) via adapters.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "037001c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "dd52ac96",
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.evals import LLM\n",
    "\n",
    "llm = LLM(\n",
    "    provider=\"openai\", model=\"gpt-4o-mini\"\n",
    ")  # you could also specify the client e.g. \"langchain\", \"litellm\", \"openai\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "632848d2",
   "metadata": {},
   "source": [
    "## Evaluators and Scores\n",
    "\n",
    "An evaluation is defined as any process that returns a `Score`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3080621a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hallucination result:\n",
      "{\n",
      "  \"name\": \"hallucination\",\n",
      "  \"score\": 1.0,\n",
      "  \"label\": \"factual\",\n",
      "  \"explanation\": \"The response states that Paris is the capital of France, which is accurate according to the provided context that clearly states Paris is the capital and largest city of France.\",\n",
      "  \"metadata\": {\n",
      "    \"model\": \"gpt-4o-mini\"\n",
      "  },\n",
      "  \"kind\": \"llm\",\n",
      "  \"direction\": \"maximize\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "from phoenix.evals.metrics import (\n",
    "    HallucinationEvaluator,\n",
    ")\n",
    "\n",
    "llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n",
    "hallucination_evaluator = HallucinationEvaluator(llm=llm)\n",
    "result = hallucination_evaluator.evaluate(\n",
    "    {\n",
    "        \"input\": \"What is the capital of France?\",\n",
    "        \"output\": \"Paris is the capital of France.\",\n",
    "        \"context\": \"Paris is the capital and largest city of France.\",\n",
    "    }\n",
    ")\n",
    "print(\"Hallucination result:\")\n",
    "result[0].pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b03f84a5",
   "metadata": {},
   "source": [
    "**Core Design Principle:** The output of evaluators should be rich with information.\n",
    "\n",
    "Evaluators always return a **list** of `Score` objects. Often, this will be a list of length 1, but some evaluators may return multiple scores for a single `eval_input` (e.g. precision/recall or multi-criteria evals).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b220eca8",
   "metadata": {},
   "source": [
    "## Built-In Metrics\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44004e3b",
   "metadata": {},
   "source": [
    "### Precision, Recall, F1 (multi-score)\n",
    "\n",
    "A single evaluator can return multiple scores!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a465ddb9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Results:\n",
      "Score(name='precision', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, kind='code', direction='maximize')\n",
      "Score(name='recall', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, kind='code', direction='maximize')\n",
      "Score(name='f1', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, kind='code', direction='maximize')\n"
     ]
    }
   ],
   "source": [
    "from phoenix.evals.metrics import PrecisionRecallFScore\n",
    "\n",
    "precision_recall_fscore = PrecisionRecallFScore(positive_label=\"yes\")\n",
    "result = precision_recall_fscore.evaluate(\n",
    "    {\"output\": [\"no\", \"yes\", \"yes\"], \"expected\": [\"yes\", \"no\", \"yes\"]}\n",
    ")\n",
    "print(\"Results:\")\n",
    "print(result[0])\n",
    "print(result[1])\n",
    "print(result[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05595bd1",
   "metadata": {},
   "source": [
    "## Custom LLM Classification Evaluators\n",
    "\n",
    "This is similar to `llm_classify`, for LLM-as-a-judge evaluations that output a label and explanation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "150d7c6d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"name\": \"sentiment\",\n",
      "  \"score\": 1.0,\n",
      "  \"label\": \"positive\",\n",
      "  \"explanation\": \"The text expresses a strong positive emotion towards something, indicating enjoyment or satisfaction.\",\n",
      "  \"metadata\": {\n",
      "    \"model\": \"gpt-4o-mini\"\n",
      "  },\n",
      "  \"kind\": \"llm\",\n",
      "  \"direction\": \"maximize\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "from phoenix.evals import LLM, ClassificationEvaluator\n",
    "\n",
    "llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n",
    "\n",
    "evaluator = ClassificationEvaluator(\n",
    "    name=\"sentiment\",\n",
    "    llm=llm,\n",
    "    prompt_template=\"Classify the sentiment of this text: {text}\",\n",
    "    choices={\"positive\": 1.0, \"negative\": 0.0, \"neutral\": 0.5},  # specify custom score mapping!\n",
    ")\n",
    "\n",
    "result = evaluator.evaluate({\"text\": \"I love this!\"})\n",
    "result[0].pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25e181fd",
   "metadata": {},
   "source": [
    "### About the `ClassificationEvaluator`\n",
    "\n",
    "**New features:**\n",
    "\n",
    "- Specify scores for each label\n",
    "- Runs on single records (not just a dataframe)\n",
    "- Leverages model tool calling / structured output for more reliable output parsing\n",
    "- There is also a factory function `create_classifier` to create `ClassificationEvaluator` objects.\n",
    "\n",
    "This abstraction can be easily extended to support multi-criteria evaluations where a judge is asked to evaluate an input across multiple dimensions in one request.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "638889b9",
   "metadata": {},
   "source": [
    "## Input Mapping and Transformation\n",
    "\n",
    "**Core Design Principle:** The inputs to an evaluator should be well-defined and discoverable.\n",
    "\n",
    "Every evaluator has an `input_schema` which describes what inputs it expects.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e808dbbb",
   "metadata": {},
   "source": [
    "### Use `.describe()` to inspect an `Evaluator`'s input schema\n",
    "\n",
    "Because pydantic `BaseModel` is used for the `input_schema`, input fields can be annotated with types, descriptions, and even aliases.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "343098ea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'hallucination',\n",
       " 'kind': 'llm',\n",
       " 'direction': 'maximize',\n",
       " 'input_schema': {'properties': {'input': {'description': 'The input query.',\n",
       "    'title': 'Input',\n",
       "    'type': 'string'},\n",
       "   'output': {'description': 'The response to the query.',\n",
       "    'title': 'Output',\n",
       "    'type': 'string'},\n",
       "   'context': {'description': 'The context or reference text.',\n",
       "    'title': 'Context',\n",
       "    'type': 'string'}},\n",
       "  'required': ['input', 'output', 'context'],\n",
       "  'title': 'HallucinationInputSchema',\n",
       "  'type': 'object'}}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hallucination_evaluator.describe()  # requires strings for input, output, and context"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "aa53db77",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'exact_match',\n",
       " 'kind': 'code',\n",
       " 'direction': 'maximize',\n",
       " 'input_schema': {'properties': {'output': {'title': 'Output',\n",
       "    'type': 'string'},\n",
       "   'expected': {'title': 'Expected', 'type': 'string'}},\n",
       "  'required': ['output', 'expected'],\n",
       "  'title': 'Exact_matchInput',\n",
       "  'type': 'object'}}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from phoenix.evals.metrics import exact_match\n",
    "\n",
    "exact_match.describe()  # requires string output and expected"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb78bcd1",
   "metadata": {},
   "source": [
    "### Use `input_mapping` to map/transform data into expected `input_schema`\n",
    "\n",
    "An evaluator's input arguments may not perfectly match those in your example or dataset. Or, you may want to run multiple evaluators on the same example, but they have different or conflicting `input_schema`'s.\n",
    "\n",
    "You may have noticed that `Evaluators` accept an `eval_input` payload rather than keyword arguments.\n",
    "\n",
    "**Core Design Principle:** You should not have to modify your data to run evaluations.\n",
    "\n",
    "To extract the values from a nested `eval_input` payload, provide an `input_mapping` that maps evaluator's input fields to a path spec in your original data.\n",
    "\n",
    "**Possible Mapping Values:**\n",
    "\n",
    "- top-level keys in your JSON\n",
    "- a path spec following JSON path syntax\n",
    "- callable functions\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "53f9a351",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"name\": \"hallucination\",\n",
      "  \"score\": 0.0,\n",
      "  \"label\": \"hallucinated\",\n",
      "  \"explanation\": \"The evaluation cannot be determined as the provided data does not contain specific details in the query, context, or response to analyze for factual accuracy or hallucination. More information is needed.\",\n",
      "  \"metadata\": {\n",
      "    \"model\": \"gpt-4o-mini\"\n",
      "  },\n",
      "  \"kind\": \"llm\",\n",
      "  \"direction\": \"maximize\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "# example nested eval input for a RAG system\n",
    "eval_input = {\n",
    "    \"input\": {\"query\": \"user input query\"},\n",
    "    \"output\": {\n",
    "        \"responses\": [\"model answer\", \"model answer 2\"],\n",
    "        \"documents\": [\"doc A\", \"doc B\"],\n",
    "    },\n",
    "    \"expected\": \"correct answer\",\n",
    "}\n",
    "\n",
    "# in order to run the hallucination evaluator, we need to process the eval_input to the fit the input schema\n",
    "input_mapping = {\n",
    "    \"input\": \"input.query\",  # dot notation to access nested keys\n",
    "    \"output\": \"output.responses[0]\",  # brackets to access list elements\n",
    "    \"context\": lambda x: \" \".join(\n",
    "        x[\"output\"][\"documents\"]\n",
    "    ),  # lambda function to combine the document chunks\n",
    "}\n",
    "\n",
    "# the evaluator uses the input_mapping to transform the eval_input into the expected input schema\n",
    "result = hallucination_evaluator.evaluate(eval_input, input_mapping)\n",
    "result[0].pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09d15d39",
   "metadata": {},
   "source": [
    "### Use `bind_evaluator` to bind an `input_mapping` to an `Evaluator` for reuse\n",
    "\n",
    "Note: We don't need to remap \"expected\" for the `exact_match` eval because it already exists in our `eval_input`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "8eb96ffd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"name\": \"hallucination\",\n",
      "  \"score\": 0.0,\n",
      "  \"label\": \"hallucinated\",\n",
      "  \"explanation\": \"The provided task does not contain specific details in the query, context, or response that allow us to assess the accuracy of the response. Without specific text from the query, context, or response, it's impossible to determine whether it is factual or hallucinated. Therefore, I cannot provide a definitive answer.\",\n",
      "  \"metadata\": {\n",
      "    \"model\": \"gpt-4o-mini\"\n",
      "  },\n",
      "  \"kind\": \"llm\",\n",
      "  \"direction\": \"maximize\"\n",
      "}\n",
      "{\n",
      "  \"name\": \"exact_match\",\n",
      "  \"score\": 0.0,\n",
      "  \"metadata\": {},\n",
      "  \"kind\": \"code\",\n",
      "  \"direction\": \"maximize\"\n",
      "}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[None, None]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from phoenix.evals import bind_evaluator\n",
    "\n",
    "# we can bind an input_mapping to an evaluator ahead of call time for easier sequential evals\n",
    "evaluators = [\n",
    "    bind_evaluator(evaluator=hallucination_evaluator, input_mapping=input_mapping),\n",
    "    bind_evaluator(evaluator=exact_match, input_mapping={\"output\": \"output.responses[0]\"}),\n",
    "]\n",
    "scores = []\n",
    "for evaluator in evaluators:\n",
    "    scores.append(evaluator.evaluate(eval_input))  # no need to pass input_mapping each time\n",
    "\n",
    "[score[0].pretty_print() for score in scores]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf3b5a01",
   "metadata": {},
   "source": [
    "## A Convenient Decorator\n",
    "\n",
    "Use the `create_evaluator` decorator to turn any function that returns something \"score-like\" into an `Evaluator`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a374439",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Score(name='text_length', score=0.5, label='short', explanation=None, metadata={}, kind='code', direction='maximize')]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from phoenix.evals import create_evaluator\n",
    "\n",
    "\n",
    "# heuristic/code evaluator that returns a tuple of score and label\n",
    "@create_evaluator(name=\"text_length\")\n",
    "def text_length_score(text: str) -> tuple[float, str]:\n",
    "    \"\"\"Score text based on length (longer = better, up to a point)\"\"\"\n",
    "    length = len(text)\n",
    "    if length < 10:\n",
    "        score = 0.0\n",
    "        label = \"too_short\"\n",
    "    elif length < 50:\n",
    "        score = 0.5\n",
    "        label = \"short\"\n",
    "    elif length < 200:\n",
    "        score = 1.0\n",
    "        label = \"good_length\"\n",
    "    else:\n",
    "        score = 0.8\n",
    "        label = \"too_long\"\n",
    "\n",
    "    return (score, label)\n",
    "\n",
    "\n",
    "text_length_score.evaluate({\"text\": \"This is a test\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69e57a21",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'keyword_presence',\n",
       " 'kind': 'code',\n",
       " 'direction': 'maximize',\n",
       " 'input_schema': {'properties': {'text': {'title': 'Text', 'type': 'string'},\n",
       "   'keywords': {'items': {'type': 'string'},\n",
       "    'title': 'Keywords',\n",
       "    'type': 'array'}},\n",
       "  'required': ['text', 'keywords'],\n",
       "  'title': 'Keyword_presenceInput',\n",
       "  'type': 'object'}}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from phoenix.evals import Score, create_evaluator\n",
    "\n",
    "\n",
    "# heuristic/code evaluator that returns a Score object with metadata\n",
    "@create_evaluator(name=\"keyword_presence\", kind=\"code\", direction=\"maximize\")\n",
    "def keyword_presence_score(text: str, keywords: list[str]) -> tuple[float, str, str]:\n",
    "    \"\"\"Score text based on presence of keywords\"\"\"\n",
    "    text_lower = text.lower()\n",
    "    keyword_list = keywords\n",
    "\n",
    "    found_keywords = [k for k in keyword_list if k in text_lower]\n",
    "    score = len(found_keywords) / len(keyword_list) if keyword_list else 0.0\n",
    "\n",
    "    return Score(\n",
    "        score=score,\n",
    "        label=f\"found_{len(found_keywords)}_of_{len(keyword_list)}\",\n",
    "        explanation=f\"Found keywords: {found_keywords}\",\n",
    "        metadata={\"found_keywords\": found_keywords, \"total_keywords\": len(keyword_list)},\n",
    "    )\n",
    "\n",
    "\n",
    "keyword_presence_score.describe()  # input schema is inferred from the function signature"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f562301",
   "metadata": {},
   "source": [
    "## Dataframe Evaluation\n",
    "\n",
    "Run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:\n",
    "\n",
    "1. `{score_name}_score` contains the JSON serialized score (or None if the evaluation failed)\n",
    "2. `{evaluator_name}_execution_details` contains information about the execution status, duration, and any exceptions that ocurred.\n",
    "\n",
    "**Notes:**\n",
    "\n",
    "- use `bind_evaluator` to bind `input_mappings` to your evaluators so they match your dataframe columns.\n",
    "\n",
    "### Example 1: Async version with multiple evaluators\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "ebb79bd4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "74e5fc58d5c94f13a71aa8bf0e0f808d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Evaluating Dataframe |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>output</th>\n",
       "      <th>expected</th>\n",
       "      <th>context</th>\n",
       "      <th>query</th>\n",
       "      <th>response</th>\n",
       "      <th>hallucination_execution_details</th>\n",
       "      <th>exact_match_execution_details</th>\n",
       "      <th>hallucination_score</th>\n",
       "      <th>exact_match_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Yes</td>\n",
       "      <td>Yes</td>\n",
       "      <td>This is a test</td>\n",
       "      <td>What is the name of this test?</td>\n",
       "      <td>First test</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'name': 'hallucination', 'score': 0.0, 'label...</td>\n",
       "      <td>{'name': 'exact_match', 'score': 1.0, 'metadat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Yes</td>\n",
       "      <td>No</td>\n",
       "      <td>This is another test</td>\n",
       "      <td>What is the name of this test?</td>\n",
       "      <td>Another test</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'name': 'hallucination', 'score': 1.0, 'label...</td>\n",
       "      <td>{'name': 'exact_match', 'score': 0.0, 'metadat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>No</td>\n",
       "      <td>No</td>\n",
       "      <td>This is a third test</td>\n",
       "      <td>What is the name of this test?</td>\n",
       "      <td>Third test</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'name': 'hallucination', 'score': 1.0, 'label...</td>\n",
       "      <td>{'name': 'exact_match', 'score': 1.0, 'metadat...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  output expected               context                           query  \\\n",
       "0    Yes      Yes        This is a test  What is the name of this test?   \n",
       "1    Yes       No  This is another test  What is the name of this test?   \n",
       "2     No       No  This is a third test  What is the name of this test?   \n",
       "\n",
       "       response                    hallucination_execution_details  \\\n",
       "0    First test  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "1  Another test  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "2    Third test  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "\n",
       "                       exact_match_execution_details  \\\n",
       "0  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "1  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "2  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "\n",
       "                                 hallucination_score  \\\n",
       "0  {'name': 'hallucination', 'score': 0.0, 'label...   \n",
       "1  {'name': 'hallucination', 'score': 1.0, 'label...   \n",
       "2  {'name': 'hallucination', 'score': 1.0, 'label...   \n",
       "\n",
       "                                   exact_match_score  \n",
       "0  {'name': 'exact_match', 'score': 1.0, 'metadat...  \n",
       "1  {'name': 'exact_match', 'score': 0.0, 'metadat...  \n",
       "2  {'name': 'exact_match', 'score': 1.0, 'metadat...  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from phoenix.evals import LLM, async_evaluate_dataframe, bind_evaluator\n",
    "from phoenix.evals.metrics import HallucinationEvaluator, exact_match\n",
    "\n",
    "exact_match._input_mapping = {}  # unset the input mapping from earlier\n",
    "\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"output\": [\"Yes\", \"Yes\", \"No\"],\n",
    "        \"expected\": [\"Yes\", \"No\", \"No\"],\n",
    "        \"context\": [\"This is a test\", \"This is another test\", \"This is a third test\"],\n",
    "        \"query\": [\n",
    "            \"What is the name of this test?\",\n",
    "            \"What is the name of this test?\",\n",
    "            \"What is the name of this test?\",\n",
    "        ],\n",
    "        \"response\": [\"First test\", \"Another test\", \"Third test\"],\n",
    "    }\n",
    ")\n",
    "\n",
    "llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n",
    "\n",
    "hallucination_evaluator = bind_evaluator(\n",
    "    evaluator=HallucinationEvaluator(llm=llm),\n",
    "    input_mapping={\"input\": \"query\", \"output\": \"response\"},\n",
    ")\n",
    "\n",
    "result = await async_evaluate_dataframe(\n",
    "    dataframe=df, evaluators=[hallucination_evaluator, exact_match]\n",
    ")\n",
    "result.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc19f57c",
   "metadata": {},
   "source": [
    "### Example 2: Sync version with multi-score evaluator\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e6544f39",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7d0a0b12f05c4e99bff26c39e5e3a997",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Evaluating Dataframe |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>output</th>\n",
       "      <th>expected</th>\n",
       "      <th>precision_recall_fscore_execution_details</th>\n",
       "      <th>precision_score</th>\n",
       "      <th>recall_score</th>\n",
       "      <th>f1_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[Yes, Yes, No]</td>\n",
       "      <td>[Yes, No, No]</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'name': 'precision', 'score': 0.5, 'metadata'...</td>\n",
       "      <td>{'name': 'recall', 'score': 1.0, 'metadata': {...</td>\n",
       "      <td>{'name': 'f1', 'score': 0.6666666666666666, 'm...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[Yes, No, No]</td>\n",
       "      <td>[Yes, No, No]</td>\n",
       "      <td>{'status': 'COMPLETED', 'exceptions': [], 'exe...</td>\n",
       "      <td>{'name': 'precision', 'score': 1.0, 'metadata'...</td>\n",
       "      <td>{'name': 'recall', 'score': 1.0, 'metadata': {...</td>\n",
       "      <td>{'name': 'f1', 'score': 1.0, 'metadata': {'bet...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           output       expected  \\\n",
       "0  [Yes, Yes, No]  [Yes, No, No]   \n",
       "1   [Yes, No, No]  [Yes, No, No]   \n",
       "\n",
       "           precision_recall_fscore_execution_details  \\\n",
       "0  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "1  {'status': 'COMPLETED', 'exceptions': [], 'exe...   \n",
       "\n",
       "                                     precision_score  \\\n",
       "0  {'name': 'precision', 'score': 0.5, 'metadata'...   \n",
       "1  {'name': 'precision', 'score': 1.0, 'metadata'...   \n",
       "\n",
       "                                        recall_score  \\\n",
       "0  {'name': 'recall', 'score': 1.0, 'metadata': {...   \n",
       "1  {'name': 'recall', 'score': 1.0, 'metadata': {...   \n",
       "\n",
       "                                            f1_score  \n",
       "0  {'name': 'f1', 'score': 0.6666666666666666, 'm...  \n",
       "1  {'name': 'f1', 'score': 1.0, 'metadata': {'bet...  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from phoenix.evals import evaluate_dataframe\n",
    "from phoenix.evals.metrics import PrecisionRecallFScore\n",
    "\n",
    "precision_recall_fscore = PrecisionRecallFScore(positive_label=\"Yes\")\n",
    "\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"output\": [[\"Yes\", \"Yes\", \"No\"], [\"Yes\", \"No\", \"No\"]],\n",
    "        \"expected\": [[\"Yes\", \"No\", \"No\"], [\"Yes\", \"No\", \"No\"]],\n",
    "    }\n",
    ")\n",
    "\n",
    "result = evaluate_dataframe(dataframe=df, evaluators=[precision_recall_fscore])\n",
    "result.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4562865b",
   "metadata": {},
   "source": [
    "# Practice: BYO Judge\n",
    "\n",
    "**Your task:** Create a custom LLM judge to classify text complexity. Inputs can be classified into one of the following labels: simple, moderate, or complex. For your use case, simple text is better than moderate or complex.\n",
    "\n",
    "Use the following 3 examples to test your new evaluator:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "b303d2b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "    {\n",
    "        \"text\": \"AI is when computers learn to do things like people, like recognizing faces or playing games.\"\n",
    "    },\n",
    "    {\n",
    "        \"text\": \"Machine learning is a method in artificial intelligence where systems improve their performance by learning from data, without being explicitly programmed for each task\"\n",
    "    },\n",
    "    {\n",
    "        \"text\": \"Artificial intelligence systems employing deep reinforcement learning utilize hierarchical neural architectures to iteratively optimize policy gradients across high-dimensional state-action spaces, converging toward sub-optimal equilibria in stochastic environments via backpropagated reward signals and temporally extended credit assignment mechanisms.\"\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99a0c630",
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your judge here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6153d5fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# test your judge on the examples here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "663a6682",
   "metadata": {},
   "source": [
    "# Practice: BYO Code Evaluator\n",
    "\n",
    "**Your task:** Turn the following function into an Evaluator that calculates the Levenshtein distance between two strings.\n",
    "\n",
    "Note: Smaller values indicate higher similarity (lower score = better).\n",
    "\n",
    "Run the Evaluator on the following data:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "f7128859",
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_input = {\n",
    "    \"input\": {\"query\": \"What is the capital of France?\"},\n",
    "    \"output\": {\"response\": \"It is Paris\"},\n",
    "    \"expected\": \"Paris\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd1112c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# turn this function into a code evaluator\n",
    "def levenshtein_distance(s1: str, s2: str) -> int:\n",
    "    \"\"\"\n",
    "    Compute the Levenshtein distance between two strings s1 and s2.\n",
    "    \"\"\"\n",
    "    m, n = len(s1), len(s2)\n",
    "\n",
    "    dp = [[0] * (n + 1) for _ in range(m + 1)]\n",
    "\n",
    "    for i in range(m + 1):\n",
    "        dp[i][0] = i\n",
    "    for j in range(n + 1):\n",
    "        dp[0][j] = j\n",
    "\n",
    "    for i in range(1, m + 1):\n",
    "        for j in range(1, n + 1):\n",
    "            cost = 0 if s1[i - 1] == s2[j - 1] else 1\n",
    "            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)\n",
    "\n",
    "    return dp[m][n]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a71c057a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# test your evaluator on the example above.\n",
    "# hint: use an input_mapping to map/transform the input to the function's expected arguments."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.23"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
